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Abstract 

We perform extensive Monte Carlo simulations of a lattice model and the Go po- 
tential to investigate the existence of folding pathways at the level of contact cluster 
formation for two native structures with markedly different geometries. Our analysis 
of folding pathways revealed a common underlying folding mechanism, based on nu- 
cleation phenomena, for both protein models. However, folding to the more complex 
geometry (i.e. that with more non-local contacts) is driven by a folding nucleus whose 
geometric traits more closely resemble those of the native fold. For this geometry 
folding is clearly a more cooperative process. 



1 Introduction 

Protein folding is the process by which a linear chain of amino acids spontaneously acquires a specific 
three-dimensional native structure [J. As pointed out by Levinthal in the late 1960s a random 
search of the conformational space for the global minimum of the free energy (i.e. for the unique 
native fold) is not compatible with the biological timeframe of folding 2 . This raised the hypothesis 
that folding might have to occur through an ordered sequence of events (i.e. an ordered sequence 
of conformational changes) for the protein to reach rapidly its native conformation when starting 
from the unfolded state. In other words, kinetic pathways of folding, comprising or not specific 
intermediates, were envisaged to explain the timescale of protein folding [2l[3]. 
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The discovery in the early 1990s that the 64-residue protein chymotripsin inhibitor 2 (CI2) 
folds rapidly with single-exponential (two-state) kinetics [3] showed that the existence of discrete 
folding intermediates is not a pre-requisite to fold fast. Indeed, the vast majority of small (with less 
than 120 amino acids), single domain proteins are, like CI2, rapid two-state folders [5]- Another 
'simplifying' feature of small proteins is their topology-dependent folding kinetics; the contact order, 
CO, [BJ, measuring the average sequence separation of contacting residues in the native fold, and 
other related metrics of native geometry [TJ [8] are strongly correlated with folding rates, suggesting 
that native topology plays a key role in determining the folding mechanism. 

A protein engineering method termed 0-value analysis [9| revealed that CI2 folds via nucleation- 
condensation (NC) [10], a mechanism that was first observed by Shakhnovich and collaborators in 
the context of simulations of lattice proteins [TT|. In the NC mechanism the formation of a small set 
of local native bonds, stabilized by a few non-local native interactions, the so-called folding nucleus, 
triggers the rapid emergence of the native fold. Subsequent studies suggested that NC is possibly 
the most common folding mechanism amongst single domain proteins |12] , 

The problem of identifying folding pathways along with the formation of folding nuclei is therefore 
of the utmost importance in protein chemistry and has been investigated within different frame- 
works [151 122] , Computer simulations of protein folding and unfolding, both on- and off-lattice, 
have proved particularly useful in exploring protein folding pathways and mechanisms at different 
levels of structural detail []]2[]2][]ll[i]a[l7][l8][^ For example, at the 

micro-structural level of contact formation it was shown that folding is dominated by a well-defined 
sequence of events [15] and that the sequencing of events depends primarily on the native geometry 
as defined by the CO [14]. On the other hand a more recent study, revealed that the unfolding 
process of CI2 happens in a highly parallel fashion [2BJ. At a coarser level of structure defined by 
contact clusters (i.e. secondary structure elements) sequential folding events have been reported 
within different simulational frameworks [141 1261 |2T1 1171 120] , 

Here we use a lattice model and the Go potential to explore in some detail the folding pathways 
leading to different native geometries. In particular, we determine the order according to which 
different sections of the native fold become structured as folding progresses toward the native state. 
For both geometries there is one section that exhibits a distinctively different folding pattern. More- 
over, the timely formation of this particular section determines the most probable folding pathways. 
By comparison with previous studies, based on specific strategies to identify the folding nucleus, 
we have confirmed that this unique section, identified through the analysis of the folding pathways, 
does contain the critical contacts forming the folding nucleus. 

This article is organized in the following way. In the next section we describe the model and 
simulational methods employed, then we present and discuss the results of the simulations, and 
finally we draw some concluding remarks. 



We consider a simple three-dimensional lattice model of a protein molecule with chain length JV=48. 
In such a minimalist model amino acids, represented by beads of uniform size, occupy the lattice 
vertices and the peptide bond, that covalently connects amino acids along the polypeptide chain, is 
represented by sticks with uniform (unit) length corresponding to the lattice spacing. 

To mimic protein energetics we use the Go model |27] . In the Go model the energy of a confor- 
mation, defined by the set of bead coordinates {fl}, is given by the contact Hamiltonian 



where the contact function A (fl — fj), is unity only if beads i and j form a non-covalent native 
contact, i.e., a contact between a pair of beads that is present in the native structure, and is zero 



2 Model and Methods 



2.1 Go model and simulation details 
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Geometry 


ACO 


Fraction LR 


T op t 


folding time (xlO B MCS) 


1 


21.4 


0.74 


0.65 


8.1 ±0.5 


2 


10.0 


0.33 


0.66 


2.3 ±0.1 



Table 1: Absolute contact order (ACO), fraction of long-range (LR) contacts, optimal folding tempera- 
ture, T op t, and folding time for geometries 1 and 2. 

otherwise. The Go potential is based on the idea that the native fold is very well optimized energet- 
ically. Accordingly, it ascribes equal stabilizing energies (e.g., e = —1.0) to all the native contacts 
and neutral energies (e = 0) to all non-native contacts. As the Go model has a uniform distribution 
of contact energies the folding dynamics driven by the Go potential is essentially determined by the 
structural features of the native fold. 

In order to mimic the protein's relaxation towards the native state we use a Metropolis Monte 
Carlo (MC) algorithm 28, 30, 31 together with the kink-jump move set [25]. To guarantee that 
the detailed balance condition is satisfied the probability of a certain conformational change must 
be independent of the conformation adopted by the chain [30 1 131] . Therefore, at each MC step, the 
probability of applying the Metropolis criteria to a particular chain displacement is 0.2/(N + 6) if 
the change involves moving one single bead (end-move and corner-flip), or 0.8/ (iV — 3) if it involves 
the simultaneous movement of two beads (crankshafts). A MC simulation starts from a randomly 
generated unfolded conformation and the folding dynamics is monitored by following the evolution 
of the fraction of native contacts, Q = q/L, where L is the number of contacts in the native 
fold and q is the number of native contacts formed at each MC step. The number of MC steps 
required to fold to the native state (i.e., to achieve Q = 1.0) is the first passage time (FPT) and 
the folding time is computed as the mean first passage time (MFPT) of 300 simulations. Folding 
is studied at the so-called optimal folding temperature, T opt , the temperature that minimizes the 
folding time [33 [Ml EH ES] ■ 

2.2 Native Geometries 

In order to explore how native geometry alone drives the folding process, two native folds (Figure[T}, 
which are amongst the most complex (Geometry 1) and the simplest (Geometry 2) cuboid geometries 
found through lattice simulations of homopolymer relaxation [36], were considered in this study. For 
structures that like ours are maximally compact cuboids with N= 48 residues, there are 57 native 
contacts. A non-local contact between two residues i and j is considered long-range (LR) if its 
sequence separation is at least 12 units, i.e. \i — j\ > 12 [7]. Geometry 1 is characterized by a large 
number of LR contacts while in Geometry 2 the native bonds are predominantly local (Figure Q] and 
Table [1]). The larger number of LR contacts in Geometry 1 translates into a large absolute contact 
order (ACO) [6]. 

2.3 Probability to fold, P foU 

The folding probability, Pfoid(T), for a conformation T is defined as the the fraction of MC runs 
which, starting from V, fold before they unfold [37] . To compute Pf id we use an ensemble of 500 
MC runs divided into bins of 100 runs. Pfoid is firstly computed for each bin and the values thus 
found are subsequently averaged, and the respective standard deviation evaluated. Each MC run 
stops when either the native fold or some unfolded conformation is reached. A conformation is 
deemed unfolded when its total fraction of native contacts, Q, is smaller than some cut-off, Qu- 
ia order to estimate Qu we compute the probability of finding some fraction of native contacts 
Q as a function of Q in 200 MC folding runs (Figure [2} ■ Considerably small Q must necessarily 
identify unfolded (or denatured) conformations. Indeed, a high-probability peak, centered around 
the fraction of native contacts Qu = 0.1, is readily apparent in the graph reported for Geometry 1 
(Figure [2] left). Similarly, the highest probability peak appears around Qu = 0.2 for Geometry 2 
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Geometry 1 





Figure 1: Three-dimensional representation of Geometry 1 (top, left) and Geometry 2 (bottom, left), 
and their respective contact maps (right). In the contact maps each circle represents a native contact. 
Non-local LR contacts are shown in white. 
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Figure 2: Probability distribution for the fraction of native contacts, Q, for Geometry 1 (left) and 
Geometry 2 (right) as a function of Q. A conformation is considered unfolded when Q < Qu- 



(Figure [2] right). These fractions of native contacts are relatively low and therefore identify states 
with minimal residual structure. In this work we use these values of Q to define each geometry's 
cut-off value Qu- 



3 Exploring the hidden 'architecture' within a lattice 
protein 

In real globular proteins native contacts are clustered into the so-called secondary structure elements 
(a-helices, /3-sheets etc.), which have no direct analogue on the lattice. Therefore, in this coarse 
grained representation, there are not well defined clusters of contacts associated with the secondary 
structural elements. Nevertheless, it is possible to identify well defined clusters of contacts in lattice 
proteins that form well-defined sections of the native fold. We have developed a method (based on 
inter-residue contact correlation analysis) that groups native contacts into distinct protein sections 
according to whether their presence is strongly correlated. 

3.1 Target conformations 

The first step in the proposed procedure is that of selecting an ensemble of appropriate target 
conformations. These must be considerably native-like and, most importantly, committed to fold. 
In order to find such productive conformers we ran 8000 MC simulations for each model geometry 
and sampled a conformation from each independent MC run when folding was near completion (i.e. 
at a time close to the run's FPT). Conformations thus selected are dynamically uncorrelated and 
provide a sample of statistically independent elements. For every conformation we have computed 
Pfoid, along with its standard deviation. The time-to-fold, tf, was then measured for conformations 
with Pfoid > 0.9. To compute tf we have only considered MC runs where the proteins to fold 
before they unfold. A priori one would expect such high-P/ ;d conformations to be kinetically very 
close to the native state. However, for Geometry 1, a plot of the dependence of tf on the folding 
probability reveals the existence of many conformations with Pfoid > 0.98 that find the native 
state in a timeframe comparable with that observed in simulations starting from random coil-type 
conformations (Figure [3] Table [T}. These are trapped states (i.e. off-pathway to folding) and are 
eliminated from the initial sample. Indeed, the vast majority (i.e. 73%) of high Pf u conformers 
find the native state in less than 6% of the folding time. These conformations are on the folding 



5 



Geometry 1 



* 



****** * t **;** + * * 

* * ** * * * + " ti **/*it** + ****+ * + **t t*it 

j ± ,4 *j> * *+ ;* i**j'*i!*(**itf*ii|i 



0.90 



0.92 



0.94 



0.96 



0.98 



60 
50 

o 
5 

fa 30 

X 

- 20 
10 



Geometry 2 



0.90 



0.92 



0.94 



0.96 



0.98 



Figure 3: Time-to-fold, tf, as a function of the reaction coordinate, Pfoid- Long lived trapped states are 
observed in Geometry 1 (left) at very high Pfoid, but are absent in Geometry 2 (right). To measure tf of 
each conformation we considered only folding events in which the protein folded before unfolding, tf is 
the mean time-to-fold averaged over these folding events. The horizontal black lines indicate the cut-off 
times below which a conformation is committed to fold. For Geometry 1 and 2 there are respectively 
4724 and 4162 conformers with P fold > 0.9. 



pathway and can be used to cluster native bonds and shed light into the existence of putative 
protein sections. The mean fraction of native contacts in these conformations is < Q >= 0.73 and, 
on average, they differ by 10.31 native contacts. 

For Geometry 2, 95% of conformations with high Pfoid > 0.9 rapidly find the native state in less 
than 17% of the folding time. On average they differ by 9.18 contacts and, like in Geometry 1, their 
mean fraction of native contacts is < Q >= 0.73. 

3.2 Inter-residue contact correlation analysis reveals distinct pro- 
tein sections 

In conformations with high Pfoid that are committed to fold one expects that protein sections, 
comprising groups of correlated native bonds, will be formed with considerably high probability. 
We say that two native contacts a and j3 are correlated i.e., that they belong to the same section, if 
(i) they have similar probabilities of being present when an arbitrary third contact 7 is not, and (ii) 
the probability of contact 7 being present if contact a is not is similar to the probability of 7 being 
present if contact /3 is absent. Formally, conditions (i) and (ii) may be quantified by correlation 
between a and j3, C a g, defined as 

a0 = (^-2)^ + E^ (2) 

being <C 1. In the expression above p al is the conditional probability of finding contact 7 if contact 
a is not present, n a is the number of conformations in the sample where contact a is not present, 
and L = 57 is the total number of native contacts. The error associated with p al is of the order of 
1/ \fnZ- Therefore the weight of each averaged term in ((2} of either n 7 or (n a + np)/2 implies that 
C a p is determined by the terms which are measured with the highest accuracy (this is an important 
point since the measurement error associated with the difference between probabilities p la and p-,p 
increases as n 7 decreases). Using equation ((2)1 the correlation between pairs of native contacts a and 
is computed in the ensembles of target conformations selected for Geometry 1 and Geometry 2, 
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Name 


Number of Contacts 


Fraction LR 


ACO 


Geometry 1 


Section A 


17 


0.94 


30.8 


Section B 


14 


0.79 


24.1 


Section C 


21 


0.52 


13.0 


Geometry 2 


Section A 


22 


0.45 


10.9 


Section B 


17 


0.24 


7.1 


Section C 


16 


0.25 


10.5 



Table 2: Number of native bonds forming each protein section, absolute contact order (ACO) and 
fraction of long-range (LR)contacts of each protein section. 

and native contacts are ordered according to their relative correlations in the following way: starting 
with an arbitrary contact, say contact 0, contact 1 is the one with the lowest Coi i.e., the contact 
that is the most strongly correlated with contact 0, contact 2 is that with the lowest C12, and so 
on. This ordering method sheds light on existing protein sections since it block diagonalizes the 
contact matrix, C. Indeed, density plots for the probability that contact a is present if contact /3 
is not, p a 0, and for the fraction of conformations in the sample satisfying the same condition, n^g, 
reveals the existence of three protein sections namely, section A, B and C, in the two model proteins 
(Figure [4}. 

In both geometries, contacts within sections A and C are strongly correlated. This is shown by 
the low probability (i.e. dark) squared spots located along the diagonal in the p a p and in the n a p 
ordered matrices. In the p a p plot, such well-defined regions indicate that when a contact belonging 
to A (or C) is not formed, any other contact in A (or C) has a considerably low probability of 
being formed (Figure |U[a)). Correspondingly, the darker squares identifying sections A and C in 
the n a f) matrices show that for any pair of contacts within those sections, there is a small number 
of conformations in which one of the contacts in the pair is formed while the other is not (Figure 
Sib)). 

In the pa/3 and n a /3 density plots the brighter spots located in the matrices' off-diagonal indicate 
that contacts belonging to C (or A) can be formed with a relatively high probability, when a contact 
in A (or C) is missing. Hence, we conclude that the target conformations have either A or C formed. 

Contacts in section B behave differently from those in sections A and C as they are always present 
with high-probability. This is shown by the existence of the white vertical bar in the p a /s density 
plot (Figure 2{ a)) and the dark (and homogeneous) horizontal band that spans the vertical axis in 
the n a /3 matrices (Figure H^b)). The white spots on the diagonal in the p a /3 matrices indicate that, 
by contrast to contacts in sections A and C, when one contact within B is missing, other contacts 
within B may still be formed with high probability. 

Some contacts are located at the boundaries of the identified sections. The correlation between 
their presence and other contacts' presence does not fit the correlation patterns found for sections 
A, B or C. For this reason we decide not to assign them to any section and denote them by free 
contacts. There are five free bonds in Geometry 1 (namely, 4-23, 5-24, 12-33, 13-34 and 25-30) and 
two free bonds (2-9 and 13-46) in Geometry 2. 

3.3 Section's geometric traits 

The protein sections thus identified as clusters of strongly correlated native bonds form well defined, 
separate parts in the native fold (FigureJSJ. Indeed, clusters of strongly correlated bonds are grouped 
together in the protein's three dimensional representation. The structural characterization of each 
individual section is reported in Table [2] 

In Geometry 1 the three sections are geometrically different. In section A, all contacts but one 
are long-range and link residues located in opposite ends of the chain. On the other hand, about 
50% of the native bonds in section C, are local. They connect residues in the middle of the chain 
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Figure 4: Density plots of the probability (left column) and fraction of conformations (right column) 
where contact j3 is present and a is not for Geometry 1 (top) and Geometry 2 (bottom) . Native contacts 
are ordered according to their relative values of C a p (the order is the same for the p a p and n a p plots). 
The groups of contacts forming sections A, B, and C are identified. Contacts that were not assigned to 
any section ('free' contacts) are identified by the letter F. The range of p a p lies between (black) and 1 
(white), while n a p varies between (black) and 0.54 (white) in Geometry 1 and between (black) and 
0.64 (white) in Geometry 2. 
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Geometry 1 





Figure 5: Protein sections identified for Geometry 1 (top row) and Geometry 2 (bottom row). Na- 
tive contacts forming sections A, B and C are respectively colored red, blue and green, in the three 
dimensional representations (left) and contact maps (right). Note that the proteins sections identified 
as groups of correlated native bonds are grouped together in the protein's three dimensional native 
structure. 
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1.7 ± 0.1 
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0.13 


3.4 ± 0.1 
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0.11 


0.9 ± 0.2 
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A 


0.12 


2.7 ± 0.1 
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0.04 
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B 


0.00 
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A 


B 


0.00 






C 


A 


B 


0.00 





Table 3: Folding pathways at the macro-structural level of section formation (showing the first, second 
and third section to fold) and their relative probabilities of occurrence. The probabilities do not add to 
one, since there are some events in which two sections fold simultaneously. The average time elapsing 
between the formation of the first section and the formation of the last section in each pathway is given 
in units of 100000 MCS. 

(between residues 17 and 34). Contacts forming section B link residues located in the middle of the 
chain to residues located in either end of the chain. Interestingly, the geometric features of section 
B reflect those of the overall native fold. In Geometry 2, on the other hand, the three sections are 
highly geometrically similar, being formed essentially by local bonds. 

4 Folding pathways 

A folding pathway is an ordered sequence of events (i.e. of conformational changes) observed along 
the time coordinate. In this section we investigate if the previously identified protein sections become 
structured by following some preferential order, and how such ordering preferences depends on native 
geometry. In other words, we investigate the existence of folding pathways at the macro-structural 
level of section formation, and how the latter depend on the native fold geometry. In order to do 
so, the fraction of native contacts in each section, Qs, is monitored during each folding event. A 
section is considered folded from the time when its fraction of native contacts Qs reaches 1.0 until 
it decreases below a certain threshold Qg. In other words, the time at which a section folds is the 
smallest time ts such that Qs = 1.0 at time ts and Qs > Q% & t times larger than ts- 

A priori, the threshold Qs could be section specific. However, different values of Qg were tested 
for both proteins and the results reported hereafter are robust to changes in the exact value of this 
threshold. Therefore, and for the sake of simplicity, each section's was set equal to the cut-off 
Qu used previously to determine the whole protein's unfolded state. 

We consider 5000 folding events. For each one the times ts at which each section folds are 
recorded and the corresponding folding pathway is identified. The probability of observing specific 
pathways is then computed (Table [3}. 

Interestingly the most probable folding pathways are those in which section B is the second 
to fold. Structurally, this preference translates into folding starting either at the top or at the 
bottom of the native structure followed by the consolidation of the structure's middle 'layer' (Figure 
[SJ). The next most probable pathways are those where B folds first and, for both geometries, the 
probability that B folds last is vanishingly small. These observations suggest that in either case it 
is the folding of section B that determines the probability of a folding pathway. We disregarded the 
folding events in which two sections fold simultaneously (i.e. in the same MC step) as they result 
from the discretization of time and space imposed by the lattice. 

For the most probable folding pathways we measured the time elapsing between the formation 
of the first section and the emergence of the native structure. For both geometries the shorter time 
intervals are observed when section B folds first. However, these time intervals are systematically 
larger in the folding of Geometry 2. Here, and once the first section is completely formed, the 
protein takes on average 25% of the folding time to achieve the native state if it follows the slowest 
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pathway. For Geometry 1 the equivalent time interval is just 2.5% of the overall folding time. This 
feature is particularly interesting because Geometry 2 folds faster than Geometry 1 (Tabl^TJ. 

5 Section formation as a function of the folding proba- 
bility 

Here we analyze the folding progression of individual sections as a function of the probability to fold, 
Pfoid- In other words, we investigate how the different sections of the protein become structured, 
i.e., how their fraction of native bonds, Qs, evolves along the folding reaction. In order to do so, 
two ensembles, each comprising 8000 conformations, were considered for each native geometry and 
the folding probability of each conformation evaluated (section [33} Q. The probabilities of having a 
section with fraction of native bonds Qs as a function of Pf a id are shown as density plots in Figure 

El 

We start with the analysis of Geometry 1. Here, section A is essentially unfolded for the most 
part of the folding reaction. Indeed, up to Pfoid ~ 0.8, the most probable conformations are those 
with fraction of native bonds Qa ~ 0.1, and it is only when folding is near completion that the 
probability to find A folded or close to folded (i.e., with Qa > 0.9) is non-zero. Due to its local 
nature, bonds in section C can break and form more easily than in other sections where non-local 
bonds abound. It is perhaps for this reason that Qc distributes rather uniformly in the range 
0.1 < Qc < 0.75 up to late folding stages (i.e., up to Pfoid ~ 0.8,). It is only when Pfoid > 0.9 that 
there is a significant group of conformations with more than 90% of section C folded. 

While there is not a correspondence between the behavior of Pfoid and that of the fraction of 
native contacts formed in A and C - in the sense that higher (lower) Pfoid does not necessarily 
imply higher (lower) Qs - for section B, on the other hand, at high Pfoid, Qb is on average high, 
while early on in folding (at low Pfoid) section B is essentially unfolded. Therefore, an increase in 
Pfoid typically leads to an increase in Qb, suggesting that the folding of section B acts as a driver 
for the folding of the whole protein. 

For Geometry 2 the folding scenarios of sections A and C are rather distinct from those found 
in the more complex Geometry 1. Indeed, for Geometry 2, the probability of finding sections A 
and C with fraction of native bonds Qs is strongly bimodal for any Pfoid- This means that at any 
stage of the folding reaction it is possible to find conformations with either A or C almost folded 
(peak at high Qs) and others where A and C are very little structured (peak at low Qs)- This 
observation agrees with our previous findings regarding the most probable folding pathways, where 
folding initiates at A (and C folds last) or, conversely, it starts at C (and A folds last). However, as 
with Geometry 1, the fraction of native bonds of section B increases with Pfoid and when it achieves 
some critical value, it becomes large enough to prompt folding of sections A or C. 

To gain further insight into the folding reactions of both model proteins we have determined how 
the average fraction of native bonds in each section, < Qs >, changes with Pfoid (Figure 0). 

The average fraction of native bonds in sections A and C decreases considerably when folding 
of Geometry 2 is near completion (Figure [TJb)), which is suggestive of existing unfolding events at 
large Pfoid- This presumably happens due to a partial or complete folding of both section A and C 
prior to the complete folding of section B. Such unfolding events, which are required to ensure that 
folding follows the right pathway, do not occur to such an extent in Geometry 1 where A and B 
cannot fold at low Pf u (Figure [7f a)), perhaps due to topological constraints. A comparison of the 
data reported on Figures [7J a) and[7^b) with that shown in Figure [Tfc) indicates that the folding of 
the whole protein follows the folding of section B, in agreement with the idea that section B drives 
the folding of the whole native structure. 

The standard deviation &Pj rAd was also measured. Hence, the probability for a conformation T to have some Pfoid 
is considered to be given by the Gaussian distribution with average Pf id(T) and standard deviation apj old (T). These 
Gaussian distributions are used as weighting terms for calculating the probabilities of having a section with fraction of 
native bonds Qs as a function of Pfoid 
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Figure 6: Density plots of the probability for having a certain Qs as a function of Pfoid for the sections 
A, B and C in Geometry 1 (top) and Geometry 2 (bottom). 
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Figure 7: Average fraction of native bonds in each protein section, Q s , as a function of Pfoid in Geometry 
1 (a) and Geometry 2 (b). Also shown is the dependence of the protein's average fraction of native bonds 
on the reaction coordinate, Pfoid, for both geometries. Note that when folding is near completion at 
high Pfoid there is a sharp increase in the fraction of native contacts for Geometry 1. 
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6 From macro- to micro-structural formation: Evidence 
for nucleation phenomena 

A post-critical folding nucleus (FN) is defined as a set of native bonds which, once formed, prompts 
rapid and highly probable folding [11] , We have recently developed a methodology, based on the 
concept of folding probability, aimed at identifying critical (i.e. nucleating) bonds in the folding of 
small lattice proteins |38| . In a related effort, a simulational proxy of the phi- value analysis was used 
to identify nucleating residues in the folding of the two model proteins investigated in the present 
work 10"] . We have found that the set of residues 6, 33 and 35 in Geometry 1, and residues 19, 20 
and 29 in Geometry 2 lead to the largest increase in folding time upon mutation. Interestingly, the 
vast majority (i.e. more than 60%) of contacts formed by these residues are present in section B of 
both proteins. 

The conclusion that section B encapsulates the nucleating residues (and therefore the set of 
native bonds forming a post-critical FN) can actually be drawn from an independent analysis of 
the results obtained so far. In addition to shedding light on the existence of protein sections, the 
contact correlation analysis introduced in section [3~2l shows that the folding of section B is a pre- 
requisite to observe inevitable (i.e. highly probable) folding of the whole protein. Indeed, section 
B is always folded in the high-P/ (d conformations that are on-pathway to the native state (i.e. 
that fold fast). Therefore, if these model proteins fold via nucleation, section B must necessarily 
contain the critical residues forming the FN. In support of this argument we have found an increase 
in the correlation between Qb and Pfoid for both geometries when folding is near completion (i.e. 
Pfoid > 0.85), which implies that the folding of section B determines the inevitable folding of the 
whole protein. For example, in Geometry 1, a conformation where section B is folded has folding 
probability Pfoid > 0.93. Also illuminating is the fact that the presence, with high probability, of the 
bonds forming section B is independent of which other bonds are formed in the protein. Moreover, 
the most probable folding pathways are those in which section B folds early, while those in which it 
folds last have a vanishingly small probability of occurrence. 

7 Cooperativity at the level of macro-structural forma- 
tion 

In protein folding the term cooperativity is generally used in connection with specific thermodynamic 
and kinetic features exhibited by small, single domain proteins. Indeed, extraordinary experimental 
traits such as the linear chevron behavior (kinetic cooperativity), and the verification of the van 
t'Hoff criterion (thermodynamic cooperativity) have been typically ascribed to the existence of 
highly unusual energetics involving non-additive multi-body interactions [391 141] . 

The results reported in this work are suggestive that Geometry 1 folds in a more cooperative 
manner than Geometry 2. This difference in cooperative behavior is particularly evident from the 
study of section formation along the reaction coordinate (Figure [BJ. Here, it is shown that both 
section A and C in Geometry 2, can have the vast majority of its bonds formed (Qs > 0.9) early 
on in folding (i.e. at low Pfoid)- In Geometry 1, on the other hand, the formation of bonds within 
one section does not happen in such an independent manner. Indeed, it is only when folding of 
the overall protein is near completion (i.e. for Pfoid > 0.9) that the fraction of bonds within each 
section comes close to unity. Also suggestive of the more cooperative behavior of Geometry 1 are 
the considerably smaller times elapsing between the formation of the first section and the folding of 
the whole protein ([3]). Indeed, not only these time intervals are considerably smaller in Geometry 
1 than in Geometry 2, as they are (on average) 33% smaller than the cut-off time that was used to 
select the conformations that fold inevitably fast from other high Pfoid conformations (section [3TT}. 
For Geometry 2 such time intervals are similar to this cut-off parameter and much larger than the 
average folding time of conformations on pathway with Pfoid > 0.9. These times are in line with the 
finding that the first section to fold can do it relatively early during the process (i.e. Pfoid « 0.9), 
Finally, the higher cooperativity of Geometry 1 is also evident from the sharper increase in the 
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fraction of native contacts, Q, that is observed near the very end of its folding process (Figure [T^c)). 

8 Conclusions 

In the present work we investigated the existence of folding pathways for two model proteins differing 
in native geometry at a coarse-grained level of structure formation. To this end we developed and 
applied a methodology, based on native contact correlation analysis, which identifies protein sections 
with clusters of highly correlated native bonds. The latter were shown to map onto well defined 
structural three dimensional domains within the native fold. 

Three protein sections and four folding pathways, corresponding to different ordering preferences 
of section formation, were identified for each protein. Interestingly, the analysis of folding pathways 
at a macro-structural level of structure formation revealed a common underlying folding mechanism, 
based on nucleation phenomena, for both target geometries. Indeed, our results show that one of 
the protein sections contains a set of critical bonds that form the folding nucleus. In the most 
complex geometry this section, and the folding nucleus, have a topology similar to that of the native 
fold [13113]. 

Despite these similarities, it was identified a relevant difference between the folding processes 
related to the different cooperative behavior of the two proteins. The higher cooperativity observed 
for the most complex geometry is probably due to the larger number of non-local, long-range native 
bonds of the native fold as well as of the folding nucleus [27 . In other words the higher cooperativity 
of the folding process of the complex geometry is ascribed to the non-trivial order of the native fold, 
that is mimicked by that of of the folding nucleus. Despite the small size of the two model proteins 
this structural difference has a marked effect in the dynamics of the folding process and for the 
complex geometry it resembles the dynamics of first order transitions in the thermodynamic limit. 
Quantitative measures of cooperativity, and in particular the size dependence of the nucleation 
barrier for the different geometries, are outside the scope of this work [44] . 

We speculate that by introducing chemical specificity in our model proteins the number, or 
at least the probability of occurrence, of the folding pathways identified here, and that are solely 
driven by native geometry, will probably change. The use of a sequence-specific model (e.g. using 
the Miyazawa-Jernigan potential) is, however, out of the scope of the present study and will be 
investigated in future work. 
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